Adaptive Step-size Policy Gradients with Average Reward Metric

نویسندگان

  • Takamitsu Matsubara
  • Tetsuro Morimura
  • Jun Morimoto
چکیده

In this paper, we propose a novel adaptive step-size approach for policy gradient reinforcement learning. A new metric is defined for policy gradients that measures the effect of changes on average reward with respect to the policy parameters. Since the metric directly measures the effects on the average reward, the resulting policy gradient learning employs an adaptive step-size strategy that can effectively avoid falling into a stagnant phase from the complex structure of the average reward function with respect to the policy parameters. Two algorithms are derived with the metric as variants of ordinary and natural policy gradients. Their properties are compared with previously proposed policy gradients through numerical experiments with simple, but non-trivial, 3-state Markov Decision Processes (MDPs). We also show performance improvements over previous methods in on-line learning with more challenging 20-state MDPs.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Adaptive Batch Size for Safe Policy Gradients

PROBLEM • Monotonically improve a parametric gaussian policy πθ in a continuous MDP, avoiding unsafe oscillations in the expected performance J(θ). • Episodic Policy Gradient: – estimate ∇̂θJ(θ) from a batch of N sample trajectories. – θ′ ← θ+Λ∇̂θJ(θ) • Tune step size α and batch size N to limit oscillations. Not trivial: – Λ: trade-off with speed of convergence← adaptive methods. – N : trade-off...

متن کامل

On the Adaptivity Gap of Stochastic Orienteering

The input to the stochastic orienteering problem [13] consists of a budget B and metric (V, d) where each vertex v ∈ V has a job with a deterministic reward and a random processing time (drawn from a known distribution). The processing times are independent across vertices. The goal is to obtain a nonanticipatory policy (originating from a given root vertex) to run jobs at different vertices, t...

متن کامل

THE EXISTENCE OF A STATIONARY e - OPTIMAL POLICY FOR A FINITE MARKOV CHAIN

In this paper we investigate the problem of optimal control of a Markov chain with a finite number of states when the control sets are compact in the metric space. The goal of the control is to maximize the average reward per unit step. For the case of finite control and state sets the existence of a stationary optimal policy was proved in [1] and [2]. In [3]-[5] it was proved that for a contro...

متن کامل

Approximation algorithms for stochastic orienteering

In the Stochastic Orienteering problem, we are given a metric, where each node also has a job located there with some deterministic reward and a random size. (Think of the jobs as being chores one needs to run, and the sizes as the amount of time it takes to do the chore.) The goal is to adaptively decide which nodes to visit to maximize total expected reward, subject to the constraint that the...

متن کامل

The Wavelet Transform-Domain LMS Adaptive Filter Algorithm with Variable Step-Size

The wavelet transform-domain least-mean square (WTDLMS) algorithm uses the self-orthogonalizing technique to improve the convergence performance of LMS. In WTDLMS algorithm, the trade-off between the steady-state error and the convergence rate is obtained by the fixed step-size. In this paper, the WTDLMS adaptive algorithm with variable step-size (VSS) is established. The step-size in each subf...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010